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1  Abstract 


This  report  is  the  final  technical  report  for  the  project:  The  Raw  Fabric:  A  Technology 
for  Rapid  Embedded-System  Customization.  The  Raw  fabric  is  a  universal  computational 
substrate  suitable  for  signal  processing  and  embedded  applications.  The  key  innovation 
behind  Raw  fabrics  is  the  ability  for  software  to  customize  chip-level  communication  chan¬ 
nels  in  an  application-specific  manner,  thereby  enabling  the  construction  of  mission-specific 
embedded  systems  cost-effectively.  Raw  fabrics  offer  the  promise  of  orders  of  magnitude 
improvements  for  embedded  applications  when  compared  to  microprocessor-based  systems. 
These  improvements  are  in  performance,  power,  and  size,  and  will  allow  system  customiza¬ 
tion  to  be  measured  in  hours  instead  of  years.  Raw  Fabrics  comprise  single  Raw  chips  with 
on-chip  customizable  interconnect,  and  board-level  systems  containing  many  Raw  chips.  Our 
project  has  built  a  Raw  chip  prototype  and  a  handheld  computer  system  based  on  Raw. 
Our  results  demonstrate  that  Raw  performs  at  or  close  to  the  level  of  the  best  specialized 
machine  for  many  application  classes.  When  compared  to  a  Pentium-Ill  implemented  in  the 
same  technology,  Raw  displays  one  to  two  orders  of  magnitude  more  performance  for  stream 
applications,  while  performing  within  a  factor  of  two  for  sequential-desktop  applications. 

2  Overview  of  the  Project 

We  begin  the  report  by  providing  a  technical  overview  of  the  project. 

The  Raw  fabric  [1]  is  a  universal  computational  substrate  that  is  suitable  for  signal  pro¬ 
cessing  and  embedded  applications.  The  key  innovation  behind  raw  fabrics  is  the  ability  for 
software  to  customize  chip-level  communication  channels  in  an  application-specific  manner. 
Raw  Fabrics  comprise  single  chip  Raw  systems  with  on-chip  customizable  interconnect,  and 
board- level  systems  that  comprise  one  to  many  Raw  chips. 

Raw  Fabrics  address  a  major  problem  both  with  extant  special  purpose  hardware  sys¬ 
tems  and  general  purpose  machines.  First,  modern  supercomputers,  built  from  state-of-the- 
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art  COTS  microprocessors,  have  failed  to  eliminate  the  need  for  specialized  hardware  in  the 
signal  processing  and  embedded  domains.  Although  supercomputing  systems  have  the  high¬ 
est  computational  power,  their  inability  to  cost-  effectively  utilize  this  power  for  solving  signal 
processing  problems  have  led  to  the  proliferation  of  ASICs,  FPGAs,  DSPs  and  full-custom 
hardware.  Second,  the  need  for  special  purpose  hardware  is  even  more  acute  in  the  embedded 
application  domain,  where  efficient  utilization  of  area,  weight,  and  power  is  paramount.  Un¬ 
fortunately,  the  enormous  cost  and  lengthy  time-to-deployment  of  special-purpose  hardware 
systems  significantly  reduces  their  appeal. 

The  Raw  fabric  draws  its  design  motivation  from  both  the  strengths  and  weaknesses 
of  using  custom  hardware  for  signal  processing  and  embedded  applications.  There  are  three 
main  benefits  to  developing  custom  hardware:  First,  when  processing  a  stream  of  data,  the 
ability  to  customize  pipeline  stages  provides  an  order-of- magnitude  performance  improvement 
over  using  a  fixed  pipeline  when  the  energy  budget  is  fixed.  Second,  custom  hardware  is  able 
to  efficiently  orchestrate  direct  data  movement  between  pipeline  stages.  In  contrast,  using 
a  fixed  memory  hierarchy  with  caches  is  very  inefficient  in  handling  certain  access  patterns, 
such  as  stream  data.  Third,  a  custom  design  can  tailor  its  resources  to  match  both  the  level 
and  the  granularity  of  the  available  parallelism  in  the  application.  This  approach  is  more 
efficient  than  using  a  processor  supporting  a  fixed  amount  of  parallelism.  A  fourth,  and 
minor  advantage  of  custom  hardware,  is  the  ability  to  efficiently  meet  the  granularity  of  data 
required  by  the  application  by  customizing  the  size  of  registers,  data  paths,  and  ALUs. 

Despite  these  advantages,  custom  hardware  has  a  series  of  shortcomings.  One  of  the 
biggest  drawbacks  in  using  custom  hardware  is  its  inflexibility.  The  inability  to  change 
the  applications  that  run  on  a  given  hardware  platform  dramatically  reduces  their  cost- 
effectiveness.  Although  FPGAs  were  partially  successful  in  addressing  this  problem,  seamless 
reconfiguration  during  continuous  operation  is  yet  to  be  achieved.  More  importantly,  the 
inability  of  these  devices  to  present  an  abstraction  of  unlimited  resources  renders  the  task 
of  mapping  programs  to  these  devices  incredibly  difficult.  Because  ASICs  are  application- 
specific  and  cannot  be  applied  to  multiple  problems,  designing  a  custom  ASIC  for  an  algorithm 
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can  only  afford  a  fraction  of  the  development  cost  of  designing  a  microprocessor.  Therefore, 
it  is  not  feasible  to  produce  an  ASIC  with  the  same  clock  speed  as  a  microprocessor  of 
the  same  generation.  Furthermore,  it  becomes  prohibitively  expensive  to  design  a  custom 
ASIC  for  each  new  process  generation,  while  porting  an  application  to  a  later  version  of  a 
microprocessor  is  relatively  simple. 

Our  project  consisted  of  four  major  components,  which  together  provide  a  complete 
polymorphous  computing  environment.  The  four  components  are  the  Raw  Processor,  the 
Raw  Fabric,  the  Raw  Compiler,  and  the  Raw  Operating  System.  As  described  shortly,  in 
each  of  the  components,  we  resolved  many  open  research  issues  and  technical  challenges. 

In  summary,  this  project  designed  and  built  a  flexible  and  scalable  computation  fabric 
that  can  be  morphed  into  solving  many  embedded  applications  in  an  energy,  area,  and  time- 
efficient  manner.  A  major  component  of  this  research  was  the  Raw  processor.  The  Raw 
processor  is  a  simple  tiled  architecture  with  an  innovative  communication  subsystem  and  is 
an  ideal  building  block  for  larger  computation  fabrics.  As  such,  the  Raw  fabric  is  an  early 
proof-of-concept  prototype  of  the  general  polymorphous  computing  architecture  concept. 

In  the  project,  we  investigated  many  novel  micro-  architectural  features,  compiler  algo¬ 
rithms,  and  operating  system  components.  Each  of  these  are  central  to  a  successful  polymor¬ 
phous  computing  environment. 

The  project  also  completed  the  design  of  a  multi-Raw-chip  fabric.  The  project  also 
implemented  an  optimized  compiler  and  operating  system  for  the  Raw  environment. 

The  project  developed  several  prototype  Raw  systems  that  are  now  in  use  at  two  DARPA 
sites  including  ISI  and  ATL  at  Lockheed  Martin.  Additional  boards  are  in  the  process  of 
being  tested  and  they  will  go  to  several  more  of  our  DARPA  collaborators. 

3  Summary  of  Accomplishments  and  Systems  Built 

The  following  are  the  major  components  of  our  system  that  were  implemented  in  our  project. 
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1.  Completion  of  fabrication  of  a  Raw  chip  with  an  embedded  on-chip  customizable  fabric 

2.  Completion  of  a  single-board  Raw  system  (referred  to  as  the  “handheld  system”)  for 
single-chip  embedded  applications,  including 

(a)  Completion  of  the  implementation  of  a  handheld  communicator  system  for  stan¬ 
dalone  general  purpose  computation 

(b)  design  of  an  embedded  wireless  processor 

(c)  design  of  an  embedded  networking  fabric  and  control  plane 

(d)  Implemented  a  32-node  acoustic  beamformer.  This  is  being  extended  to  IK  nodes. 

3.  The  design  of  a  multiple- Raw  chip  fabric  system. 

4.  Design  of  a  PCI  interface 

5.  Optimization  of  a  stream  language  and  compiler 

6.  A  beamformer  microphone  array  design 


4  Accomplishments  and  Progress 

This  section  provides  specific  details  of  our  accomplishments. 

We  built  a  working  prototype  of  a  Polymorphous  Computing  Architecture  platform 
based  on  the  Raw  infrastructure.  The  following  sections  discuss  the  components  in  detail. 

4.1  The  Raw  Chip 

The  Raw  processor  was  a  major  system  that  we  built.  We  implemented  the  prototype  Raw 
microprocessor  in  the  SA-27E  ASIC  flow,  which  uses  IBM’s  CMOS  7SF,  a  180nm,  6-layer 
copper  process.  We  received  120  chips  from  IBM  in  October  of  2002.  We  are  pleased  to 
report  that  there  were  no  bugs  in  first  silicon. 
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Figure  1  shows  a  micro-photograph  of  the  Raw  die.  The  16-tile  geometry  of  the  chip 


can  be  clearly  made  out. 
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Figure  1:  Photo  of  the  Raw  chip. 


4.2  The  Raw  Handheld  Board 

We  validated  the  Raw  processor  on  a  prototype  mother  board  called  the  Raw  handheld  board. 
We  built  several  such  boards  along  with  our  collaborators  at  ISI.  Boards  are  in  use  at  ISI  and 
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ATL  (Lockheed  Martin).  Each  board  contains  the  Raw  chip,  SDRAM  chips,  I/O  interfaces 
and  interface  FPGAs. 

Figure  2  shows  a  photograph  of  the  Raw  motherboard. 

We  have  also  designed  an  embedded  wireless  processor,  an  embedded  networking  fabric 
and  control  plane,  and  a  32-node  acoustic  beamformer  (to  be  extended  to  IK  nodes  in  the 
following  year). 

We  also  implemented  a  high-speed  USB  interface  for  the  motherboard.  This  allows  the 
board  to  be  connected  to  any  laptop  of  PC  with  a  USB2  interface. 

We  are  nearing  completion  of  the  PCI  interface,  which  will  allow  the  Raw  motherboard 
to  function  as  a  standalone  system  and  provide  even  higher  speed  I/O. 

4.3  The  Multi-Chip  Raw  Fabric 

We  designed  the  Raw  fabric  board  and  the  Raw  fabric  I/O  board  as  well  (along  with  our 
collaborators  at  ISI).  The  fabric  board  contains  4  Raw  chips  and  can  be  connected  in  a  mesh 
along  with  other  fabric  boards.  The  I/O  board  plugs  into  the  periphery  of  the  fabric  board 
mesh  and  provides  I/O,  memory  and  other  expansion  functions. 

In  the  next  phase  we  will  implement  and  test  these  boards  and  build  a  fabric  system 
containing  64  Raw  chips. 

4.4  The  RawCC  Compiler  and  the  Stream  Compiler 

We  built  RawCC,  which  takes  sequential  C  or  FORTRAN  programs  and  compiles  them  on 
to  the  Raw  fabric.  We  developed  the  analysis  necessary  to  extract  ILP  (instruction-level 
parallelism)  out  of  sequential  programs.  Thus,  programs  written  in  the  SUIF  (Stanford 
University  Intermediate  Form)  supported  languages  of  FORTRAN,  C,  are  able  to  use  our 
compiler. 

We  also  built  a  complete  steaming  compiler  and  language  called  Streamit.  The  language 
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Figure  2:  Photo  of  the  Raw  prototype  motherboard.  The  board  is  13  inches  by  13  inches. 
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syntax  is  like  that  of  Java  and  allows  the  user  to  express  stream  programs  effectively.  Streamit 
and  the  compiler  have  been  distributed  to  the  DARPA  community. 

4.5  The  Raw  OS 

We  developed  a  prototype  host-based  operating  system  for  the  Raw  fabric.  Our  initial  fabric 
operating  system  includes  features  needed  to  support  I/O  devices  and  OS  services  expected 
by  applications.  We  developed  and  deployed  a  nano-kernel  for  each  tile.  A  set  of  POSIX 
commands  have  been  implemented  using  a  few  dedicated  OS  tiles  and  cooperating  nano¬ 
kernels.  We  have  distributed  this  OS  to  the  DARPA  community. 

We  also  developed  RawGDB  a  debugger  for  Raw.  RawGDB  has  also  been  released  to 
external  users. 

We  also  implemented  the  compiler  and  runtime  system  needed  for  the  software  supported 
instruction  cache  and  SUDS  [2]  (software  undo  system)  systems. 

4.6  Applications,  Experimentation  and  Evaluation 

We  performed  a  substantial  amount  of  experimentation  of  applications  using  the  real  Raw 
system.  We  also  validated  our  simulator  against  the  real  hardware  and  conducted  more 
experiments.  We  used  our  working  RawCC  compiler,  the  stream  compiler  and  Raw  OS  for 
this  task.  ISI  and  Lincoln  Labs  have  also  developed  several  PCA  applications  on  Raw. 

Specifically,  here  are  some  highlights.  The  domains  we  examined  include  ILP  computa¬ 
tion,  and  stream  and  embedded  computation.  The  performance  of  Raw  in  these  individual 
areas  are  presented  as  comparison  to  a  reference  600  MHz  Pentium  III,  because  the  Pentium 
III  was  manufactured  using  the  same  180  nrn  technology  as  Raw. 

We  note  that  Raw  achieves  greater  than  16x  speedup  (versus  a  single  tile)  for  several 
applications  (listed  in  6).  Table  1  discusses  the  various  factors  that  helped  Raw. 

Tables  2  and  3  show  functional  unit  timings  and  memory  system  characteristics  for  both 
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Factor  responsible 

Maximum  Speedup 

Tile  parallelism  (Exploitation  of  Gates) 

16x 

Load/store  elimination  (Management  of  Wires) 

3x 

Streaming  mode  vs  cache  thrashing  (Management  of  Wires) 

60x 

Streaming  I/O  bandwidth  (Management  of  Pins) 

60x 

Increased  cache/register  size  (Exploitation  of  Gates) 

~2x 

Bit  Manipulation  Instructions  (Specialization) 

3x 

Table  1:  Sources  of  speedup  for  Raw  over  P3  (as  configured  in  Table  3). 

systems,  respectively.  Table  4  shows  Raw’s  measured  power  consumption.  Table  5  lists  a 
breakdown  of  the  end-to-end  message  latency  on  Raw’s  scalar  operand  network.  The  low  3- 
cycle  inter-tile  ALU-to-ALU  latency  and  zero  cycle  send  and  receive  occupancies  are  critical 
for  obtaining  good  performance  for  ILP. 
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Table  2:  Functional  unit  timings  on  a  single  Raw  tile  and  on  a  P3.  Commonly  executed 


instructions  appear  first.  FP  operations  are  single  precision. 
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1  Raw  Tile 

P3 

CPU  Frequency 

425  MHz 

600  MHz 

Sustained  Issue  Width 

1  in-order 

3  out-of-order 

Mispredict  Penalty 

3 

10-15 

DRAM  Freq  (RawPC) 

100  MHz 

100  MHz 

DRAM  Freq  (RawStreams) 

425  MHz 

100  MHz 

DRAM  Access  Width 

8  bytes 

8  bytes 

LI  D  cache  size 

32K 

16K 

LI  I  cache  size 

32K 

16K 

LI  miss  latency 

54  cycles 

7  cycles 

LI  fill  width 

4  bytes 

32  bytes 

LI  line  sizes 

32  bytes 

32  bytes 

LI  associativities 

2- way 

4- way 

L2  size 

- 

256K 

L2  associativity 

- 

8-way 

L2  miss  latency 

- 

79  cycles 

L2  fill  width 

- 

8  bytes 

Table  3:  Memory  system  data  for  Raw  tile  and  P3. 
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Core 

Pins 

Idle  -  Full  Chip 

9.6  W 

0.02  W 

Average  -  Per  Active  Tile 

0.54  W 

- 

Average  -  Per  Active  Port 

- 

0.2  W 

Average  -  Full  Chip 

18.2  W 

2.8  W 

Table  4:  Raw  power  consumption  at  425  MHz,  25°  C 


Latency 

Sending  Processor  Occupancy 

0 

Latency  to  Network  Input 

1 

Latency  per  hop 

1 

Latency  from  Network  Output  to  ALU 

1 

Receiving  Processor  Occupancy 

0 

Table  5:  Breakdown  of  the  end-to-end  latency  (in  cycles)  for  a  one- word  message  on  Raw’s 
static  network. 

Much  like  a  VLIW  (very  long  instruction  word)  architecture,  Raw  relies  on  the  compiler 
to  find  and  exploit  ILP.  We  now  examine  how  well  Raw  is  able  to  support  ILP.  For  this 
evaluation,  we  select  a  range  of  benchmarks  that  encompasses  a  wide  spectrum  of  program 
types  and  degree  of  ILP.  For  some  of  the  irregular  integer  benchmarks  that  Rawcc  is  not 
mature  enough  to  orchestrate,  we  compile  and  execute  them  on  one  tile  to  get  a  conservative 
worst  case  bound  on  their  performance  on  Raw.  Table  6  presents  the  performance  of  these 
benchmarks  on  RawPC  and  on  the  P3. 

Of  the  benchmarks  in  our  study,  Raw  is  able  to  outperform  the  P3  for  all  the  scientific 
benchmarks  and  several  irregular  applications.  Of  these,  about  half  have  speedups  in  the 
2-3  range,  but  the  other  half  have  more  promising  speedups  in  the  4-7  range.  At  the  other 
end  of  the  spectrum  for  the  integer  applications  run  on  a  single  Raw  tile,  our  sampling  of 
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applications  showed  that  a  Raw  tile  is  roughly  a  factor  of  2  slower. 


Raw 

Cycles 

Speedup  vs  P3 

Benchmark 

Source 

Tiles 

on  Raw 

by  Cycles 

by  Time 

Dense-Matrix  Scientific  Applications 

Swim 

Spec95 

16 

58M 

4.0 

2.8 

Tomcatv 

Spec92 

16 

3.2M 

1.9 

1.4 

Btrix 

Nasa7:Spec92 

16 

4.6M 

5.5 

3.9 

Cholesky 

Nasa7:Spec92 

16 

5.5M 

2.9 

2.5 

Mxm 

Nasa7:Spec92 

16 

2.0M 

3.5 

2.5 

Vpenta 

Nasa7:Spec92 

16 

2.5M 

10.3 

7.3 

Jacobi 

Raw  benchmark  suite 

16 

150K 

6.4 

4.5 

Life 

Raw  benchmark  suite 

16 

4.0M 

7.4 

5.2 

Sparse-Matrix/Integer  Applications 

Fpppp-kernel 

Spec92 

16 

150K 

11.2 

7.9 

SHA 

Perl  Oasis 

16 

920K 

1.9 

1.3 

Unstructured 

CHAOS 

16 

15M 

1.1 

0.75 

Adpcm 

Mediabench 

1 

20M 

0.85 

0.60 

GSM 

Mediabench 

1 

310M 

0.57 

0.40 

175. vpr 

Spec  2000 

1 

2.9B 

0.71 

0.51 

300.twolf 

Spec  2000 

1 

2.3B 

0.56 

0.40 

Table  6:  Performance  of  sequential  programs  on  Raw  and  on  a  P3. 
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Table  7  shows  the  speedups  achieved  by  Rawcc  as  the  number  of  tiles  varies  from  two  to 
16.  The  speedups  are  compared  to  performance  of  a  single  Raw  tile.  Overall,  the  source  of 
speedups  comes  primarily  from  tile  parallelism  (see  Table  1),  but  several  of  the  dense  matrix 
benchmarks  benefit  from  increased  cache  capacity  with  parallel  access  as  well  (which  ex¬ 
plains  the  super- linear  speedups).  In  addition,  Fpppp-kernel  benefits  from  increased  register 
capacity,  which  leads  to  fewer  spills. 


Benchmark 

Number  of  tiles 

2 

4 

8 

16 

Dense-Matrix  Scientific  Applications 

Swim 

1.4 

2.2 

4.3 

7.9 

Tomcatv 

1.6 

3.0 

4.2 

4.8 

Btrix 

1.6 

4.5 

10.3 

21.8 

Cholesky 

1.9 

3.7 

6.2 

6.1 

Mxm 

1.3 

3.7 

5.7 

7.0 

Vpenta 

1.6 

4.9 

11.3 

22.0 

Jacobi 

1.7 

4.2 

8.2 

16.5 

Life 

0.8 

2.0 

4.9 

10.5 

Sparse-Matrix/Irregular  Applications 

SHA 

1.1 

1.8 

1.9 

2.3 

F  pppp-kernel 

1.4 

3.3 

6.5 

7.4 

Unstructured 

1.2 

2.0 

2.1 

2.0 

Table  7:  Speedup  of  the  ILP  benchmarks  relative  to  the  single-tile  Raw,  from  two  to  16  tiles. 

Next,  we  present  performance  of  stream  computations  for  Raw.  Stream  computations 
arise  naturally  out  of  real-time  I/O  applications  as  well  as  from  embedded  applications.  The 
data  sets  for  these  applications  are  often  large  and  may  even  be  a  continuous  stream  in 
real-time,  which  makes  them  unsuitable  for  traditional  cache  based  memory  systems.  Raw 
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provides  a  more  natural  support  for  stream  based  computation  by  allowing  data  to  be  fetched 
efficiently  through  a  register  mapped,  software  orchestrated  network. 

The  following  results  are  for  programs  written  in  Streamlt,  a  high  level  stream  lan¬ 
guage,  and  automatically  compiled  to  Raw.  Streamlt  is  a  high-level,  architecture-independent 
language  for  high-performance  streaming  applications.  Streamlt  contains  language  con¬ 
structs  that  improve  programmer  productivity  for  streaming,  including  hierarchical  struc¬ 
tured  streams,  graph  parameterization,  and  circular  buffer  management;  these  constructs 
also  expose  information  to  the  compiler  and  enable  novel  optimizations.  We  have  developed 
a  Raw  backend  for  the  Streamlt  compiler,  which  includes  fully  automatic  load  balancing, 
graph  layout,  communication  scheduling,  and  routing. 

We  evaluate  the  performance  of  RawPC  on  several  Streamlt  benchmarks,  which  repre¬ 
sent  large  and  pervasive  DSP  applications.  Table  8  summarizes  the  performance  of  16  Raw 
tiles  vs.  a  P3.  For  both  architectures,  we  use  Streamlt  versions  of  the  benchmarks;  we  do 
not  compare  to  hand-coded  C  on  the  P3  because  Streamlt  performs  at  least  1-2X  better  for 
5  of  the  7  applications  (this  is  due  to  aggressive  unrolling  and  constant  propagation  in  the 
Streamlt  compiler).  The  comparison  reflects  two  distinct  influences:  1)  the  scaling  of  Raw 
performance  as  the  number  of  tiles  increases,  and  2)  the  performance  of  a  Raw  tile  vs.  a  P3 
for  the  same  Streamlt  code.  To  distinguish  between  these  influences,  Table  9  shows  detailed 
speedups  relative  to  Streamlt  code  running  on  a  1-tile  Raw  configuration. 
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Streamlt 

Benchmark 

on  P3 

Beamformer 


Bitonic  Sort 


FFT 


Filterbank 


FIR 


FMRadio 


Matrix  Mult 


2  4  8 


2.6  1.0  3.7  4.0  8.: 


.2  1.0  1.9  3.4  5.3 


1.0  1.0  1.3  1.7  2.4 


0.72  1.0  1.3  1.3  3.4 


3.4  1.0  2.3  5.6  12 


.2  1.0  1.0 


1.0  2.0  2.9  2.8 


Table  9:  Speedup  (in  cycles)  of  Streamlt  benchmarks  relative  to  a  1-tile  Raw  configuration. 
From  left,  the  columns  indicate  the  Streamlt  version  on  a  P3,  and  on  Raw  configurations 
with  one  to  16  tiles. 
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The  primary  result  illustrated  by  Table  9  is  that  Streamlt  applications  scale  effectively 
for  increasing  sizes  of  the  Raw  configuration.  For  FIR,  FFT,  and  Bitonic,  the  scaling  is 
approximately  linear  across  all  tile  sizes  (FIR  is  actually  super-linear  due  to  decreasing  register 
pressure  in  larger  configurations).  For  other  applications,  the  scaling  is  slightly  inhibited  for 
small  configurations.  This  is  because  1)  IMEM  constraints  prevent  an  unrolling  optimization 
for  small  tile  sizes  (Beamformer,  FM,  Matrix  Mult)  and  2)  there  is  some  constant  overhead 
that  is  amortized  in  larger  configurations. 

The  second  influence  is  the  performance  of  a  P3  vs.  a  single  Raw  tile  on  the  same 
Streamlt  code,  as  illustrated  by  the  second  column  in  Table  9.  In  most  cases,  performance  is 
comparable.  The  P3  performs  better  in  two  cases  because  it  can  exploit  ILP:  Beamformer  has 
independent  real/imaginary  updates  in  the  inner  loop,  and  FIR  is  a  fully  unrolled  multiply- 
accumulate  operation.  In  other  cases,  ILP  is  obscured  by  circular  buffer  accesses  and  control 
dependences. 

In  all,  Streamlt  applications  benefit  from  Raw’s  exploitation  of  parallel  resources  and 
management  of  wires.  The  abundant  parallelism  and  regular  communication  patterns  in 
stream  programs  are  an  ideal  match  for  the  parallelism  and  tightly  orchestrated  communi¬ 
cation  on  Raw.  As  stream  programs  often  require  high  bandwidth,  register-mapped  commu¬ 
nication  serves  to  avoid  costly  memory  accesses.  Also,  autonomous  streaming  components 
can  manage  their  local  state  in  Raw’s  distributed  data  caches  and  register  banks,  thereby 
improving  locality.  These  aspects  are  key  to  the  scalability  demonstrated  in  the  Streamlt 
benchmarks. 

4.7  Support  in  Standardizing  a  Morphware  Stable  Interface 

Our  project  has  played  an  active  role  in  the  Morphware  Forum. 

Specifically,  we  have  taken  a  leadership  role  in  specifying  the  Streaming  Virtual  Machine 
(SVM)  which  will  provide  a  common  interface  for  high-level  compilation  tools  to  target  all 
PCA  architectures.  In  order  to  achieve  high  performance  for  streaming  applications,  the  SVM 
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embraces  a  separation  of  control  and  data-intensive  code,  explicit  communication  via  streams, 
and  explicit  memory  management  for  streaming  data.  It  also  adopts  a  novel  two-stage  com¬ 
pilation  strategy  whereby  high-level  tools  input  a  description  of  the  target  architecture  (using 
the  PCA  Machine  Model)  and  compile  to  a  version  of  the  SVM  that  is  parameterized  for  that 
architecture.  Reaching  consensus  on  this  interface  has  involved  detailed  design  discussions 
with  many  of  the  PCA  teams,  including  Stanford,  Raytheon,  Georgia  Tech,  USC  and  the 
University  of  Texas.  Along  with  Reservoir  Labs,  we  have  coordinated  the  design  process 
and  have  produced  a  stable  specification  that  is  serving  as  a  cornerstone  of  the  Morphware 
toolchain.  We  have  also  expressed  the  Raw  architecture  in  terms  of  the  Machine  Model,  and 
are  actively  engaged  with  the  evaluation  of  the  Reservoir  High-Level  Compiler  that  targets 
the  SVM. 

5  Conclusions  and  Recommendations  for  Future  Work 

This  report  described  the  architecture  and  implementation  of  the  Raw  prototype.  Raw’s 
exposed  ISA  (instruction  set  architecture)  allows  parallel  applications  to  exploit  all  of  the 
chip  resources,  including  gates,  wires  and  pins.  Raw  supports  ILP  by  scheduling  operands 
over  a  scalar  operand  network  that  offers  very  low  latency  for  scalar  data  transport.  Raw’s 
compiler  manages  the  effect  of  wire  delays  by  orchestrating  both  scalar  and  stream  data 
transport.  The  Raw  processor  demonstrates  that  existing  architectural  abstractions  like 
interrupts,  caches,  and  context-switching  can  continue  to  be  supported  in  this  environment, 
even  as  applications  take  advantage  of  the  low-latency  scalar  operand  network  and  the  large 
number  of  ALUs. 

Our  results  demonstrate  that  the  Raw  processor  performs  at  or  close  to  the  level  of 
the  best  specialized  machine  for  each  application  class.  When  compared  to  a  Pentium  III, 
Raw  displays  one  to  two  orders  of  magnitude  more  performance  for  stream  applications, 
while  performing  within  a  factor  of  two  for  low-ILP  applications.  It  is  our  hope  that  the 
Raw  research  will  provide  insight  for  architects  who  are  looking  for  ways  to  build  versatile 
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processors  that  leverage  the  vast  silicon  resources  while  mitigating  the  considerable  wire 
delays  that  loom  on  the  horizon. 

Our  effort  has  pointed  to  several  future  directions  that  are  worth  exploring:  (1)  eval¬ 
uating  the  performance  for  much  larger  numbers  of  tiles  and  a  wider  set  of  programs,  (2) 
generalizing  on-chip  operand  networks  so  that  they  support  other  forms  of  parallelism  ex¬ 
ploited  by  microprocessors  such  as  stream  parallelism  and  thread  parallelism,  (3)  complete 
designs  and  evaluation  of  both  dynamic  and  compile-time  schemes  for  operation/operand  as¬ 
signment  and  scheduling,  (4)  mechanisms  for  fast  exception  handling  and  context  switching, 
(5)  a  thorough  analysis  of  the  tradeoffs  between  commit  point,  exception  handling  capability, 
and  network  latency,  (6)  low  energy  tiled  processors  and  scalar  operand  networks,  and  (7) 
tiled  architectures  with  support  for  bit-level  processing. 

6  Key  Publications 

Our  major  publications  are  these:  [3],  [1],  [4],  [5],  [6],  [7],  [8],  [9],  [10],  [11],  [12],  [13],  [14], 
[15],  [16],  [17],  [18]. 
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